White Wine Quality Analysis by Min Lai

========================================================

I explored the white wine quality data set. Sequence number column in original dataset is removed since it is not very helpful.

## 'data.frame':    4898 obs. of  12 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000

From summary, this data set has 4898 white wine samples. Here are some observations:

  1. Quality ranges from 3 to 9 with 6 as median, and 75% of wines are under 6, which looks like around half of the wines have quality 5 and 6.

  2. Fixed, volatile and citric acidity all has very wide range, for example, fixed acidity ranges from 3.8 to 14.2 and the max value 14.2 almost doubles the 3rd quantile value 7.3. Other 2 acidity variables has similar pattern which tells me that there may be some outliers at high acidity end.

  3. Sugar, Chlorides also has similar patterns as acidity. For example the max Sugar is 65 while 3rd quintile is only 9.9. Max Chlorides is 0.346 and 3rd quantile is 0,05.

  4. Density range is pretty narrow from 0.9871 to 1.0390

  5. Min pH is 2.72 and max is 3.82.

  6. Min alcohol is 8.0 and max is 14.2

Univariate Plots Section

## $title
## [1] "Quality Distribution"
## 
## attr(,"class")
## [1] "labels"
## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

There are only 5 white wines sample has best quality((9) and 20 samples with worst quality(3). The best quality wines may be very rare and hard to find and worse ones may be due to production defect which also not very many. And the distribution close to normal distribution.

## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect

pH is normal distribution and there is a spike around pH 3.16-3.18

Alcohol is rarely over 14.0% and less than 8.5. Spike is at around 9.5%

Chlorides is rarely over 0.06 and there is a spike around its median value. If more outliers are trimmed, it also looks like a normal distribution.

Residual sugar has two big spikes at 1-1.5 an 1.5 -2.0. It has a long tail on right side.

Transform scale to log 10 and square root for residual sugar. In log10, I saw binomial distribution.

I grouped acidity related attributes together to compare their distributions. outliers are excluded in plots above. They have similar pattern which are all close to normal distribution especially fixed acidity distribution.

I grouped sulfur related attributes together to compare their distributions as well. outliers are excluded in plots above. The pattern also looks alike. All 3 have normal distribution with spike around their respect median value.

Univariate Analysis

What is the structure of your dataset?

There are 12 features in this dataset:Fixed.acidity, votatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, quality. Except quality is integer, other feature are all numbers. For white wine quality ranging from 3 to 9, larger the number, better the quality. Best quality wines and worse quality wine are both very few. Median quality is 6 and there are 2198 wine samples in this quality which is 45% of whole dataset.

What is/are the main feature(s) of interest in your dataset?

The main feature for this dataset is quality. I will explore relationships between quality and other features and try to find if wine quality can be predicated by its chemical attributes.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Alcohol is number one feature need to be evaluated. Then pH, fixed.acidity(other acidity attributes), total.sulfur.dioxide(other sulfur attributes) and chlorides since they are all has close to normal distribution.

Did you create any new variables from existing variables in the dataset?

No, not in this section. However I will create a new one to convert quality to factor type, so that I can use it as categorical feature in plotting.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Residual.sugar has very long tail in histogram, so I transformed it to log10 and square root to see if I can get a better distribution, which is close to normal distribution. In log10 transformation, it shows a bi-normal distribution with one pike at 1.5 and other at 8.5.

Bivariate Plots Section

Scatter plot matrix

## Warning in loop_apply(n, do.ply): Removed 10 rows containing missing
## values (geom_point).

Higher quality wines tend to have higher alcohol content.

## Warning in loop_apply(n, do.ply): Removed 160 rows containing missing
## values (geom_point).

There is no visiable correlation between quality and chlorides.

## Warning in loop_apply(n, do.ply): Removed 64 rows containing missing
## values (geom_point).

Quality decreases when density increases.

## Warning in loop_apply(n, do.ply): Removed 10 rows containing missing
## values (geom_point).

I can’t see much pH impact on quality of wines. So I would like to draw a histogram by quality

## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect

pH distribution for each wine quality is very close from quality 4-8(excluding quality 3 and 9). So pH looks like having very little impact on quality

## Warning in loop_apply(n, do.ply): Removed 61 rows containing missing
## values (geom_point).

There is no visible correlation between quality and redidual sugar. In the each quality grade, however, more wines has residual suguarless than 5

## Warning in loop_apply(n, do.ply): Removed 59 rows containing missing
## values (geom_point).

## Warning in loop_apply(n, do.ply): Removed 49 rows containing non-finite
## values (stat_boxplot).

Except wines with quality 3 and 4. It looks like that higher quality wine tends to have lower total free sulfur dioxide.

## Warning in loop_apply(n, do.ply): Removed 63 rows containing missing
## values (geom_point).

## Warning in loop_apply(n, do.ply): Removed 48 rows containing non-finite
## values (stat_boxplot).

To my surprise, quality 6, 7, 8 has very close median volatile acidity. It is not very obvious that volatile acidity has much impact on wine quality.

After exploring relationship between quality and 7 other major features, I continue to investigate the relationship between other feature pairs like density Vs alcohol, density Vs residual sugar, volatile acidity Vs fixed acidity ect.

Density decreases when alcohol increases. Density increase when residual sugar increases. And although there are some spikes, overall, residual sugar decrease as alcohol increase.

## Warning in loop_apply(n, do.ply): Removed 85 rows containing missing
## values (stat_smooth).
## Warning in loop_apply(n, do.ply): Removed 85 rows containing missing
## values (geom_point).
## Warning in loop_apply(n, do.ply): Removed 89 rows containing missing
## values (stat_smooth).
## Warning in loop_apply(n, do.ply): Removed 89 rows containing missing
## values (geom_point).

Free sulfur dioxide has strong positive correlation with total sulfur dioxide. On the other hand, Volatile acidity looks irrelevant to fixed acidity.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Quality correlats strong with Alcohol, density. Quality also has some correlction with total sulphur disxide.

  1. As Alcohol content increase, quality of wine gets better. And it looks like they have linear relationship. However, the worse quality wines has higher median alcohol than quality 4 and 5. Quality 4 median is also higher than 5. This may be due to number of samples in 3 and 4 are very small.

  2. As Density increases, quality of wine gets worse.

  3. As total sulfur dioxide increase the quality of wine gets worse for wines from 5-9 but the correlation doesn’t looks very strong

  4. PH, fixed acidity, free sulfur dioxide have no visible correlation with quality, which surprises me.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Yes. I found following interesting relationships:

  1. Density is strongly correlated with alcohol and residual sugar. As alcohol increases, density decreases. As residual sugar increase, density increase as well. In fermentation process, sugar will be decomposed and produce water and alcohol. This observation makes lots of sense.

  2. Free sulfur dioxide increases as total sulfur dioxide increases and the correlation is strong, which also makes sense.

  3. Another surprise to me, volatile acidity looks like having no relationship with fixed acidity. I guess they are different type of acids, which can’t be tranformed from one to another

What was the strongest relationship you found?

Density and residual sugar.

Multivariate Plots Section

I added more features to my plots to observe impact on white wine quality by two or more features. To avoid too many levels plots in this section, subset whitewines_sub will be used since best and worst quality categories have very few sample, which can be treated as outliers.

## Warning in loop_apply(n, do.ply): Removed 46 rows containing missing
## values (geom_point).

Lower alcohol and higher total sulfur dioxide area has more lower quality wines, and higher alcohol and lower total sulfur dioxide area has more higher quality wines.

## Warning in loop_apply(n, do.ply): Removed 44 rows containing missing
## values (geom_point).

Lower alcohol and higher fixed acidity area has more lower quality wines, and higher alcohol and lower fixed acidity area has more higher quality wines.

## Warning in loop_apply(n, do.ply): Removed 47 rows containing missing
## values (geom_point).

Higher alcohol and lower residual sugar area has very dense high quality wine data points. However, at lower alcohol area, lower quality wines almost evenly distributed across whole residual sugar range.

pH doesn’t have strong correlation with alcohol. And better quality wines has more dots on higher alcohol end and but almost evenly distributed within pH for each quality.

## Warning in loop_apply(n, do.ply): Removed 39 rows containing missing
## values (geom_point).

Free Sulfur Dioxide doesn’t have strong correlation with alcohol.

## Warning in loop_apply(n, do.ply): Removed 50 rows containing missing
## values (geom_point).

Sulphates doesn’t have strong correlation with alcohol. And it not correlated with quality.

Added new parameter total.acidity which is sum of fixed adicity, volatile acidity and citric acid.

Median total acidity is very close across all wine quality.

## Warning in loop_apply(n, do.ply): Removed 177 rows containing missing
## values (geom_point).

## $title
## [1] "pH Vs Total Acidity by Quality"
## 
## attr(,"class")
## [1] "labels"

pH decreases as total acidity increases. Quality looks has no visible correlation with pH and total acidity.

## Warning in loop_apply(n, do.ply): Removed 177 rows containing missing
## values (geom_point).

I saw similar distribution with fix acidity Vs alcohol by quality. So the new parameter doesn’t seems add too much value in analysis here.

Linear model

## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = whitewines)
## m2: lm(formula = quality ~ alcohol + total.sulfur.dioxide, data = whitewines)
## m3: lm(formula = quality ~ alcohol + total.sulfur.dioxide + fixed.acidity, 
##     data = whitewines)
## m4: lm(formula = quality ~ alcohol + total.sulfur.dioxide + residual.sugar, 
##     data = whitewines)
## 
## =============================================================
##                          m1        m2        m3        m4    
## -------------------------------------------------------------
## (Intercept)            2.582***  2.419***  2.911***  2.048***
##                       (0.098)   (0.133)   (0.167)   (0.139)  
## alcohol                0.313***  0.322***  0.317***  0.352***
##                       (0.009)   (0.010)   (0.010)   (0.011)  
## total.sulfur.dioxide             0.001     0.001*   -0.000   
##                                 (0.000)   (0.000)   (0.000)  
## fixed.acidity                             -0.066***          
##                                           (0.014)            
## residual.sugar                                       0.022***
##                                                     (0.003)  
## -------------------------------------------------------------
## R-squared                 0.190     0.190     0.194     0.202
## adj. R-squared            0.190     0.190     0.194     0.201
## sigma                     0.797     0.797     0.795     0.791
## F                      1146.395   575.100   393.081   412.870
## p                         0.000     0.000     0.000     0.000
## Log-likelihood        -5839.391 -5837.755 -5825.920 -5802.097
## Deviance               3112.257  3110.178  3095.184  3065.221
## AIC                   11684.782 11683.510 11661.839 11614.193
## BIC                   11704.272 11709.496 11694.322 11646.676
## N                      4898      4898      4898      4898    
## =============================================================

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Total sulfur dioxide and fixed acidity makes the impact of alcohol on quality stronger.

  1. As total sulfur dioxide decrease and alcohol increase, quality gets better.

  2. As fixed acidity decrease and alcohol increase, quality gets better.

Were there any interesting or surprising interactions between features?

I added up acid related parameter to create a new parameter total acidity. Total acidity has strong negative correction with pH, which makes sense.

I still can’t see much relationship between quality and pH, between quality and free sulfur dioxide. I thought that those two features should have strong correlation with quality when goggling chemical attribute about wines.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

Yes. I created one linear model. However, by the R square, the reliability of this model is very low only 0.19 So it looks like simple linear model is not a good model to predict quality of white wines.


Final Plots and Summary

Plot One

Description One

I used added factor type parameter quality.factor to re-draw histogram which look better than using quality. Quality has normal distribution and the median quality has almost half of wines in the data set, As quality decrease or increase, the number of wines decrease quickly.

Plot Two

## Warning in loop_apply(n, do.ply): Removed 19 rows containing non-finite
## values (stat_boxplot).

Description Two

To my surprise, the worse quality wines has higher median alcohol than quality 4 and 5. Quality 4 median is also higher than 5. This may be due to number of samples in 3 and 4 are comparatively small. Or maybe, other chemical component in those worst quality wine downgrades the quality even if alcohol content is relatively high. For wines quality better than 4, it is obvious that higher alcohol more likely to has higher quality.

Plot Three

## Warning in loop_apply(n, do.ply): Removed 25 rows containing missing
## values (geom_point).

Description Three

Higher quality white wines tends to have lower total sulfur dioxide and higher alcohol since density of low quality(quality 4, 5, 6) white wine is very high at high total sulfur dioxide and low alcohol area and density of high quality white wine is high at low total sulfur dioxide and high alcohol area.


Reflection

I picked up this dataset because I started to like drinking wines two years ago. I really curious if I can find some clue to tell good wines from bad ones from their chemical attributes. This set has 4898 white wine samples which is a good size for practice purpose not too big but large enough to draw nice plots.

I started with each individual variables. Wine quality are normal distributed which makes sense since most wines are in the median quality which probably sell at affordable price. So demands for this type of wine is the largest. Normal distributions are also can be seen in following features: fixed acidity, volatile acidity, citric acidity, free sulfur dioxide, total sulfur dioxide, density, pH. So I initially thought there should be a strong linear relationship between quality and other features.

Then I paired other features with quality to explore relationship. To my surprise, I only see strong correlation between quality and alcohol. There are visible but not so strong correlation between quality and fixed acidity, quality and total sulfur dioxide. pH, residual sugar were my top candidates but to my disappointment, I couldn’t find any visible correlation there. I also found the alcohol and residual sugar and density has negative correlations. And residual and density has strong correlation.

Finally, I add third features to the pairs. I did find that total sulfur dioxide and fixed acidity enhances the correlation between alcohol and quality. So I tried build a linear model using alcohol, total sulfur dioxide, fixed acidity, pH and residual sugar. It looks like that model is not very successful since the R square is very low only 0.19. That bothers me lot.

Here are some of my thought about the this dataset and possible improvement beyond scope of this project:

  1. Wines samples in this dataset are all from particular area in Portugal which may have some bias related to that region.

  2. Best and worst quality sample size are very small, which makes the modeling in high end and low end part is not very accurate.

  3. pH, and residual sugar should play a role in wine’s taste, however, I failed to find correlation in this dataset, which tells me that there may be something missing in this dataset. Or this feature can be transformed in a way that some relationship can be better represented.

  4. The quality is scaled from 0-10. In this dataset, it ranges from 3 to 9. It looks like a categorical feature to me. So linear model may not be the best way to predict. Other regression model can be used to clacify wine qualtity by its chemical property.